Lab 5
November 6, 2025
Today’s goals. Learn about:
Workflow
Let’s start
The likelihood function
Let \(\mathcal{F}\) a parametric statistical model for the data \(y\), where \(f(y;\theta)\) is the density or probability function and \(\theta \in \Theta\) is a p-dimensional parameter. By considering \(f(y;\theta)\) as function of \(\theta\) with \(y\) fixed to the observed value, then the likelihood function of \(\theta\) based on \(y\), \(L: \Theta \rightarrow \mathbb{R}^{+}\), is \[L(\theta) = L(\theta; y) \propto f_Y(y; \theta) \,\,, \] where the proportionality is related to a constant \(c(y)>0\) that does not depend on \(\theta\).
On the basis of the data \(y\), \(\theta \in \Theta\) is more likely than \(\theta' \in \Theta\) as an index of the data generating model if \(L(\theta) > L(\theta')\)
Note
Maximum likelihood estimation in the Binomial model
Let \(y=(y_{1},\ldots, y_{n})\) a sample of i.i.d. values from a Bernoulli distribution, \(Y \sim \mbox{Be}(p)\). Then the likelihood function is
\[\mathcal{L}(p) = \prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i} \] and the log-likelihood function takes the form \[\ell(p) = \sum_{i=1}^{n} y_i \log (p) + (1-y_i) \log (1-p)\] The maximum likelihood estimate, \(\hat p\), is the sample proportion \(\hat p =\frac{1}{n}\sum_{i=1}^{n}y_i\). To derive it we obtain the score function \[\ell_{\star}(p)=U(p)=\frac{\partial \ell(p)}{\partial p}= \sum_{i=1}^{n} \frac{y_i}{p} - \frac{1-y_i}{1-p} = \frac{\sum_{i=1}^{n}y_i}{p} - \frac{n- \sum_{i=1}^{n}y_i}{1-p}\] By equating at zero the score function, we obtain the maximum likelihood estimate
\[U(p) = 0 \implies \hat p = \frac{1}{n}\sum_{i=1}^{n} y_i \]
Example
Suppose generating a random sample \(y\) of size \(n=100\) from Bernoulli distribution with parameter \(p=0.6\). We want make inference on \(p\). Thus,
We want write a function taking in argument the parameter and the data and returns the log-likelihood. We can do it in two ways:
Example
We can check if both the functions return the same value. Thus, for instance by fixing \(p=0.5\)
The maximum likelihood estimate is in closed form (later we will see how carrying out the ML using numerical optimisation, which is particularly useful when the ML estimate cannot be obtained analitically), so
Example
Note that above we used the function built manually. This because that function is vectorized, while the first one should be vectorized before using it. Indeed
[1] -98.22045
[1] -69.31472 -69.41686 -171.02447
# The results above is due to this operation
sum(dbinom(x = y[seq(1,100,3)], size = 1, prob = 0.5, log = TRUE)) +
sum(dbinom(x = y[seq(2,100,3)], size = 1, prob = 0.75, log = TRUE)) +
sum(dbinom(x = y[seq(3,100,3)], size = 1, prob = 0.99, log = TRUE))[1] -98.22045
# However, we can vectorize the funtion llik_bin by means of
llik_bin_v <- Vectorize(llik_bin, 'theta')
llik_bin_v(c(0.5, 0.75, 0.99), y) # Correct[1] -69.31472 -69.41686 -171.02447
Example
Then we can plot it, including two vertical lines denoting where the true parameter value and the maximum likelihood estimate are located
par(mfrow = c(1, 2))
curve(llik_bin2(theta=x, data = y), 0, 1,
xlab= expression(p), ylab = expression(l(p)),
main = "Log-likelihood function")
abline(v = p, col = "red")
abline(v = MLEp, col = "blue")
legend("topleft", legend = c(expression(p), expression(hat(p))),
col = c("red", "blue"), lty = c(1, 1))
curve(llik_bin_v(theta=x, data = y), 0, 1,
xlab= expression(p), ylab = expression(l(p)),
main = "Log-likelihood function")
abline(v = p, col = "red")
abline(v = MLEp, col = "blue")
legend("topleft", legend = c(expression(p), expression(hat(p))),
col = c("red", "blue"), lty = c(1,1 ))Fisher information for the Binomial model
We know that \(\hat \theta \stackrel{\cdot} \sim \mathcal{N}(\theta,\mathcal{I}^{-1}(\theta) )\), where \(\mathcal{I}(\theta)\) is the expected information matrix, that is \(\mathcal{I}(\theta)=E(J(\theta; Y))\), with \(J(\theta; Y)\) the observed information matrix, that is the negative hessian. In our one-dimensional parameter example \[J(p) = -\frac{\partial^2 \ell(p)}{\partial p^2} = \bigg[\frac{\sum_{i=1}^{n}y_i}{p^2} + \frac{(n-\sum_{i=1}^{n}y_i)}{(1-p)^2}\bigg]\] Thus, since \(\sum_{i=1}^{n} {Y_i}\sim Bin(n,p)\) we have \[\mathcal{I}(p)=E[J(p; Y)]= \bigg[\frac{\sum_{i=1}^{n}E[Y_i]}{p^2} + \frac{(n-\sum_{i=1}^{n}E[Y_i])}{(1-p)^2}\bigg] = \frac{n}{ p (1-p)}\] Then \(\hat p \stackrel{\cdot}\sim \mathcal{N}(p, \frac{p(1-p)}{n})\), where we can replace \(\mathcal{I}^{-1}(p)\) with the estimate \(i^{-1}(\hat p)=j^{-1}(\hat p)=\hat p (1- \hat p)/n\)
Logit reparametrization of the Binomial model
Let us consider the logit function \(\psi(p) = {\rm logit}(p)=\log\bigg(\frac{p}{1-p}\bigg)\), which gives the log-odds. We can use it to reparametrise the model, so the parametric space is unbounded. At such point, the parameter is expressed in the logit scale, and we need to re-express them in the original scale, that is \(p(\psi) = \exp(\psi)/(1+\exp(\psi))\). We can leverage the property of invariance under reparametrisations of the maximum likelihood estimator and so the maximum likelihood estimate is simply \(\hat \psi = \log\bigg(\frac{\hat p}{1- \hat p}\bigg)\). Then, we do not need to write down the log-likelihood under the reparametrisation obtain the MLE of \(\psi\) from it.
Logit reparametrization of the Binomial model
However, we can easily visualize the log-likelihood function under the reparametrisation as
psi <- function(theta) log(theta/(1-theta))
theta <- function(psi) exp(psi)/(1+exp(psi))
llik_bin_rep <- function(param, data) llik_bin2(theta(param), data)
curve(llik_bin_rep(param = x, data = y), -10, 10, main = "Log-likelihood function",
xlab= expression(psi), ylab = expression(l(psi)))
abline(v = psi(p), col = "red")
abline(v = psi(MLEp), col = "blue")
legend("topleft", legend = c(expression(psi), expression(hat(psi))),
col = c("red", "blue"), lty = c(1, 1))Logit reparametrization of the Binomial model
The asymptotic distribution of the ML estimator of \(\psi\) is given by \(\hat \psi \stackrel{\cdot} \sim \mathcal{N}(\psi, V(\hat \psi))\)
By using the delta method, we can easily obtain
\[ V(\hat \psi) = V(\hat p) \bigg(\frac{d}{d p} \psi (p) \bigg)^2 = V(\hat p) \bigg(\frac{d}{d p} \log\frac{p}{1-p}\bigg)^2 = \frac{ p (1 - p)}{n} \frac{1}{p^2(1-p)^2} = \frac{1}{np(1-p)}\] and a consistent estimate of such variance is \(\hat V(\hat \psi) = \frac{1}{n \hat p (1 - \hat p)}\)
set.seed(1234)
R <- 5000
p <- 0.6
n2 <- 1000
prop <- rep(0, R)
for(i in 1:R){
sample <- rbinom(n2, 1, p)
prop[i] <- mean(sample)
}
par(mfrow=c(1,2))
hist(prop, freq = F, breaks = 30, main = expression(p))
curve(dnorm(x, mean = p, sd = sqrt((p * (1 - p)/n2))),
lwd = 2, xlab = "", ylab = "", add = T)
hist(psi(prop), freq = F, breaks = 30, main = expression(psi))
curve(dnorm(x, mean = psi(p), sd = 1/sqrt((p * (1 - p) * n2) )),
lwd = 2, xlab = "", ylab = "", add = T)R functions for numerical optimisation
In this example, we computed analytically some quantities and we obtain numerical values just by plugging into the formulas the input. However, remind the MLE for a parameter may not have an analytic solution. Many times we do not have a closed form for MLE estimates and we may need numerical optimisation. R provides various functions for performing numerical methods:
nlm(): minimizes a function using a Newton-type algorithm. It needs a starting value and does not allow constraints on the parameters. It is usually fast and reliable. It returns the ML estimate \(\hat{\theta}\) (estimate), the value of the likelihood \(-l(\hat{\theta})\) (minimum) and the hessian (hessian), if hessian = TRUE.
optim(): minimizes a function using Nelder-Mead, quasi-Newton and conjugate gradients algorithms. It includes an option for box-constrained optimization, and it requires a starting value. It returns the ML estimate \(\hat{\theta}\) (par) and the value of the likelihood \(-l(\hat{\theta})\) (value) and the hessian (hessian), if hessian = TRUE. You also can maximize the function by using fnscale = -1 in control argument.
nlminb(): often is more stable, robust and reliable than optim (in particular with “nasty” functions). It performs only minimization and does not yield numerical derivatives as output. It returns the ML estimate \(\hat{\theta}\) (par) and the value of the likelihood \(-l(\hat{\theta})\) (objective).
optimize(): by using a combination of golden section search and successive parabolic interpolation, searches in an interval for a minimum or a maximum (if maximum=TRUE) of a function. It returns the ML estimate \(\hat{\theta}\) (minimum) and the value of the likelihood \(l(\hat{\theta})\) (objective). Drawback: suited only for one-dimensional parameter.
Binomial example
Here, we are simply using the Binomial example to show how (some of) these functions work.
Note: In the following, we consider and implement the negative log-likehood function, since some numerical optimisers that we will see are only able to perform minimisation of functions.
In addition, we implement the score and the observed information (used for comparison with the optimiser output)
Binomial example
To check if the score and the observed information are correctly implemented, we can compute the numerical gradient and hessian of the log-likelihood function, as implemented in the {numDeriv} R package.
Important
Note that in this case, we obtain them for the negative log-likelihood function, implying that we need to change sign to the score function.
nlm optimiser
Let’s start with the first optimisernlm(considering as starting value 0.5, which is quite close to the MLE; however, see the warnings printed on the console)
$minimum
[1] 65.89557
$estimate
[1] 0.6299995
$gradient
[1] 1.421085e-08
$hessian
[,1]
[1,] 429.0957
$code
[1] 1
$iterations
[1] 5
[1] 65.89557
[1] 429
optim optimiser
The previous optimiser does not allow constraint on the parameters; however, there are some replacements in the algorithm that allow you to reach the result. If you want optimise a function which is constrained it should be better to consider optimiser allowing for box constrained optimisation, as for instance optim.
bin.optim_start1 <- optim(par = 0.5, fn = nllik_bin, hessian = TRUE,
data = y, method = "L-BFGS-B", lower = 1e-7, upper = 1-1e-7)
bin.optim_start1$par
[1] 0.6299996
$value
[1] 65.89557
$counts
function gradient
7 7
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
$hessian
[,1]
[1,] 429.0048
bin.optim_start2 <- optim(par = 0.001, fn = nllik_bin,
data = y, method = "L-BFGS-B", lower = 1e-7, upper =1-1e-7)
bin.optim_start2$par
[1] 0.6299996
$value
[1] 65.89557
$counts
function gradient
12 12
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
[,1]
[1,] 429.0048
Note
The optim() function provides a lot of numerical methods (Nelder-Mead, quasi-Newton, conjugate-gradient methods and simulated annealings). As a drawback, the user has to set up carefully the initial guesses and the method, since the final solution may be quite sensitive to these choices…To compute the observed information, the function optimHess() computes numerical derivatives of generic functions.
As an alternative, we could consider again the reparametrisation introduced above. Generally, for numerical purposes, it is convenient to reparameterize the model in such a way that the new parameter space is unbounded. Then, we explore this by using nlm()
$minimum
[1] 65.89557
$estimate
[1] 0.5322163
$gradient
[1] 2.842171e-08
$hessian
[,1]
[1,] 23.3094
$code
[1] 1
$iterations
[1] 4
[1] 65.89557
[1] 0.5322163 0.5322147 0.5322168
[,1]
[1,] 23.31
The regression framework
Regression is trying to describe a dependent variable \(\mathbf{y}\) through some statistically significant predictors \(\mathbf{x}\), linking them through a function \(f(\cdot)\) (also referred as systematic component): \[\mathbf{y} = f(\mathbf{x})+\boldsymbol{\varepsilon},\] where \(\boldsymbol{\varepsilon}\) is a measure of noise implicit in modelling (also stochastic component).
Validating model assumptions
Checking the assumptions of a regression model is not just a theoretical matter, but it is something related to scientific reproducibility, scientific transparency and visualization power. Although it is impossible to write down an exhaustive list of ‘what to do for checking a model’, we put here just some initial steps for visualising and inspecting your variables, as well as to inspect possible correlations between them and the inclusion of further predictors. The suggested steps refer to the (normal) classical linear model, \(f(\mathbf{x}) = f(\mathbf{x}; \boldsymbol{\beta}) = \mathbf{X} \boldsymbol{\beta}\), that is
\[y_i=\beta_1x_{i1} + \beta_2x_{i2} + \ldots + \beta_px_{ip} +\varepsilon_i, \quad \varepsilon_i \stackrel{iid}\sim \mathcal{N}(0,1)\] where we usually assume \(x_{i1}=1\), but they contain principles that can be extended to statistical modelling in general.
Before fitting a model
After fitting a model
Example
We consider the dataset nihills available in the {DAAG} package, containing data from the 2007 calendar for the Northern Ireland Mountain Running Association. The number of observations (races) is \(n = 23\), and for each of them we register the following variables:
dist.climb.time.timef.Objective: We want to regress the times through some predictors.
Loading the data
In the following, we focus on the time results for males. Then, we load the data.
The data structure
We explore the data structure, the first six and the last six rows.
'data.frame': 23 obs. of 4 variables:
$ dist : num 7.5 4.2 5.9 6.8 5 4.8 4.3 3 2.5 12 ...
$ climb: int 1740 1110 1210 3300 1200 950 1600 1500 1500 5080 ...
$ time : num 0.858 0.467 0.703 1.039 0.541 ...
$ timef: num 1.064 0.623 0.887 1.214 0.637 ...
dist climb time timef
Binevenagh 7.5 1740 0.8583333 1.0644444
Slieve Gullion 4.2 1110 0.4666667 0.6230556
Glenariff Mountain 5.9 1210 0.7030556 0.8869444
Donard & Commedagh 6.8 3300 1.0386111 1.2141667
McVeigh Classic 5.0 1200 0.5411111 0.6375000
Tollymore Mountain 4.8 950 0.4833333 0.5886111
dist climb time timef
Slieve Bearnagh 4.0 2690 0.6877778 0.7991667
Seven Sevens 18.9 8775 3.9027778 5.9855556
Lurig Challenge 4.0 1000 0.4347222 0.5755556
Scrabo Hill Race 2.9 750 0.3247222 0.4091667
Slieve Gallion 4.6 1440 0.6361111 0.7494444
BARF Turkey Trot 5.7 1430 0.7130556 0.9383333
Summary
A summary report of the variables allows exploring some characteristics of the variables.
dist climb time timef
Min. : 2.500 Min. : 750 Min. :0.3247 Min. :0.4092
1st Qu.: 4.000 1st Qu.:1205 1st Qu.:0.4692 1st Qu.:0.6158
Median : 4.500 Median :1500 Median :0.5506 Median :0.7017
Mean : 5.778 Mean :2098 Mean :0.8358 Mean :1.1107
3rd Qu.: 5.800 3rd Qu.:2245 3rd Qu.:0.7857 3rd Qu.:1.0014
Max. :18.900 Max. :8775 Max. :3.9028 Max. :5.9856
Exploring the relationship
In a first step, we consider only the predictor dist for regressing the dependent variable time. Let us have a quick look to the data by means of a scatterplot:
Fitting the model
Apart one, all the data times are lower than 2 hours. As an initial attempt, let’s fit a simple linear model (and see a basic model output): \[\texttt{time}_i =\beta_1+ \beta_2\texttt{dist}_i+\varepsilon_i, \ \ i=1,\ldots,n, \quad \varepsilon_i\stackrel{iid}\sim \mathcal{N}(0,\sigma^2).\]
Detailed model output
A more detailed summary of the fitting procedure can be obtained by using the summary() function
Call:
lm(formula = time ~ dist, data = nihills)
Residuals:
Min 1Q Median 3Q Max
-0.42812 -0.12193 -0.00250 0.09232 0.43028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.32530 0.07629 -4.264 0.000345 ***
dist 0.20094 0.01120 17.934 3.28e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1935 on 21 degrees of freedom
Multiple R-squared: 0.9387, Adjusted R-squared: 0.9358
F-statistic: 321.6 on 1 and 21 DF, p-value: 3.278e-14
Understanding the summary output
Call: The fitted model.Residuals: Some residual summaries, mainly quantiles (asymmetry?).Coefficients:
Estimates: coefficient estimates, \(\hat \beta_j, j = 1, \ldots, p\).Std. Error: estimated standard deviation of the estimators \(\hat \beta_j\), that is \(\sqrt{ \widehat {V (\hat \beta_j)}}\).t-value: the (observed) values of the test statistic for testing the nullity of the coefficients, \(H_0: \beta_j=0\) against \(H_1: \beta_j\neq0\), that is \(t_j = \hat \beta_j / \sqrt{ \widehat {V (\hat \beta_j)}}\).Understanding the summary output
Residual plots
Then, we can visualise some aspects of the fitted model. The first step, and likely the most important one, is to explore the residuals plots
Understanding residual plots
Plots (graphical assessment of certain linear model assumptions):
Visualizing the fitted model
Then, we can produce the scatterplot including the regression line
Of course, the fit seems good enough. However, the relationship between time and dist seems not to be purely linear…and residual plots seem to suggest a lack of fit for a few points.
par(mfrow = c(1, 1))
with(nihills, plot(dist, time, pch = 19))
abline(coef(lm_dist), col = "red", lty = "solid")
text(13, 3, expression(time == hat(beta)[0] + hat(beta)[1] * dist), col = "red")
points(nihills$dist, predict(lm_dist), col = "red", pch = 19, cex = 0.8)
nihills[17,]
segments(nihills[17,]$dist, nihills[17,]$time, nihills[17,]$dist, fitted(lm_dist)[17], lty = "dashed")lm objectComputing model quantities by hand
Here, we obtain the quantities of the output of the summary() function applied to an lm object by hand. In the following, the computations are done using matrix algebra (albeit we are working with a simple linear model). In this way, the following lines of code can be used to (eventually) carry out some checks using the multiple linear model.
lm object [,1]
(Intercept) -0.3252964
dist 0.2009417
(Intercept) dist
-0.3252964 0.2009417
[1] 3.108624e-15
[1] 2.720046e-15
V1
"Min. :-0.428118 " "-0.42812"
"1st Qu.:-0.121926 " "-0.12193"
"Median :-0.002496 " "-0.0025"
"Mean : 0.000000 " "0"
"3rd Qu.: 0.092318 " "0.09232"
"Max. : 0.430276 " "0.43028"
lm object[1] 0.03744742
[1] 0.03744742
### Estimates of stdev of regression coefficient estimator
stdb <- sqrt(diag(s2 * solve(t(X) %*% X))); stdb(Intercept) dist
0.07628629 0.01120431
(Intercept) dist
0.07628629 0.01120431
[,1]
(Intercept) -4.264153
dist 17.934328
[1] -4.264153 17.934328
[,1]
(Intercept) 3.454758e-04
dist 3.278480e-14
[1] 3.454758e-04 3.278480e-14
lm objectlm object [,1]
[1,] 0.8357971
beta.hat0 <- mean(y)
y.hat0 <- mean(y)
residuals0 <- y - y.hat0
### F-statistic
F <- ((sum(residuals0^2) - sum(residuals^2))/(p - p0))/(sum(residuals^2)/(n - p))
F[1] 321.6401
[1] 321.6401
value numdf dendf
321.6401 1.0000 21.0000
[1] 3.27848e-14